Question 1: Asymptotic vs Bootstrap Sampling
Distributions
Write an essay summarizing the concepts of Asymptotic and Bootstrap
Sampling Distributions, along with their key applications. Your
discussion should be grounded in your personal understanding of the
material. Any external sources including AI tools consulted must be
clearly cited.
Essay Prompt: Discuss the concepts of the bootstrap
sampling plan, the bootstrap sampling distribution, and the asymptotic
sampling distribution in the context of statistics (e.g., sample mean
and variance) computed from an independent and identically distributed
(i.i.d.) sample. Your discussion should:
Clearly outline the key assumptions required for each
method.
Explain the practical application of each distribution.
Provide guidance on when and why one should be preferred over the
other in statistical inference.
Asymptotic Sampling Distributions
An asymptotic sampling distribution refers to what happens to a
statistic’s sampling distribution as the sample size \(n \to \infty\). In other words, if you
repeatedly take samples of size n and compute a statistic each time, the
distribution of that statistic will approach a specific limiting
distribution as n becomes very large.
The most common example is the sample mean. Suppose we have
observations \(X_1, X_2, \ldots, X_n\)
that are independent and identically distributed (i.i.d.) from some
population with mean \(\mu\) and
variance \(\sigma^2\). Then by the
Central Limit Theorem, \[
\bar{X} \to N\left(\mu, \frac{\sigma}{\sqrt{n}}\right) \quad \text{as }
n \to \infty
\] The assumptions here are:
The observations must be independent.
The observations must be identically distributed.
The population must have a finite mean and finite
variance.
The sample size must be sufficiently large for the normal
approximation to be accurate.
This is extremely useful because once we know the distribution of the
statistic, we can use it to construct confidence intervals, perform
hypothesis tests, and compute probabilities.
There are similar asymptotic results for other statistics. For
example, the sample proportion, when based on independent Bernoulli
trials with probability of success p, follows: \[
\hat{p} \stackrel{d}{\sim} N\left(p, \frac{p(1-p)}{n}\right) \quad
\text{for large } n
\] The assumptions for this result are:
The trials are independent.
Each trial has the same probability of success p.
Both np and n(1−p) are sufficiently large to justify the normal
approximation.
There is also an asymptotic result for sample variance. When sample
size n is large, the sample variance \(S^2\) is approximately normally distributed
as \[
S^2 \stackrel{d}{\to} N\left(\sigma^2, \frac{\mu_4 - \sigma^4}{n}\right)
\quad \text{as } n \to \infty,
\] Overall, asymptotic sample distributions help us better
understand statistics as sample size grows and provide an easy way to
perform statistical inference when exact distributions are hard to come
by.
Bootstrap Sampling Distributions
Asymptotic theory proves that a statistic follows a certain
distribution when n is large. But what if n is not large? Or what if
your statistic is something like the median, a percentile, or some other
statistic where there is no simple formula? This is where bootstrap
sampling comes into play.
Bootstrap sampling is used to approximate the sampling distribution
of a statistic with fewer assumptions. Bootstrap actually uses the data
itself to estimate the distribution. Suppose you observe a sample: \(X_1, X_2, \ldots, X_n\). You first compute
whatever statistic you are interested in. Then you repeatedly draw n
observations at random with replacement from the original sample. Each
resample forms a new “bootstrap sample.” For each bootstrap sample, you
recompute the statistic. After repeating this process many times (often
thousands), you obtain a distribution of bootstrap statistics. This
empirical distribution is then used as an approximation to the
statistic’s true sampling distribution.
Bootstrap assumes the sample is representative of the population. If
the sample is biased so will the bootstrap. Once again, observations
must be independent and identically distributed. The sample size must be
sufficiently large, if you have an extremely small n it won’t represent
the population well. Bootstrap distributions can be used for the same
inferences as asymptotic distributions (CIs, Hypothesis Tests,
etc.).
You commonly use bootstrap when working with smaller sample sizes or
complex statistics where the sampling distribution is unknown or
difficult to derive. You also use bootstrap when trying to understand
how much your statistic would fluctuate from sample to sample. Since you
repeatedly compute the statistic on many re-sampled datasets, the
standard deviation of the bootstrap statistics provides an estimate of
the statistic’s variability. This value is interpreted as the bootstrap
estimate of the standard error, which tells you how stable your
statistic is.
Overall, both asymptotic and bootstrap methods approximate the
sampling distribution of a statistic in order to perform inference.
Asymptotic methods rely on theoretical large sample results, while
bootstrap methods rely on resampling from observed data. Each approach
has its advantages, and understanding their assumptions allows us to
choose the appropriate method for a given statistical problem.
Question 2: Daily Coffee Sales (in mL) at Two Different Cafe
Locations
This data set represents the volume of regular brewed coffee sold per
day (in milliliters) at two different cafe locations over a period of 50
days.
2850, 3200, 2900, 3100, 2950, 7800, 8100, 7900, 3300, 3050, 4000, 4200, 3150, 3400, 7700, 8200,
3250, 4400, 3100, 4200, 4500, 4800, 4300, 8500, 8200, 8900, 8700, 3250, 3000, 4600, 4100, 8400,
8800, 3350, 4700, 3100, 8100, 3050, 8300, 4100, 3100, 8300, 8900, 8200, 4400, 4500, 3250, 4600,
8400, 3300, 4200, 4500, 4800, 4300, 8500
We are interested in finding the sampling distribution of sample
means that will be used for various inferences about the underlying
population mean.
- Based on the given data, can the Central Limit Theorem be used to
derive the asymptotic sampling distribution of the sample mean? Justify
your answer.
The Central Limit Theorem states: the sampling distribution of the
sample mean becomes approximately normal as sample size becomes large,
regardless of the population’s distribution. We don’t know the
population’s distribution which is fine. We know we have a sample size
of n = 55 which is large enough. However, the data must be independent
and identically distributed. The prompt says each value represents the
amount of coffee sold per day at two different cafe locations over a
period of 50 days, yet we have 55 observations. If there are two cafes,
that means some observations come from one distribution and others from
a different distribution? This means the data isn’t identically
distributed. If each observation was a combination of sales, than the
data would be identically distributed. For the sake of the problem, I’m
going to assume the second case, and therefore we can use the Central
Limit Theorem to derive the asymptotic sampling distribution of the
sample mean.
- Apply the bootstrap method to estimate the sampling distribution
(often called the bootstrap sampling distribution) of the sample mean.
Generate a kernel density estimate from the bootstrap sample means and
plot it. Then, use this bootstrap distribution to validate your
conclusion from part (a). Make sure your visuals are effective in
enhancing the presentation of these results.
data <- c(2850, 3200, 2900, 3100, 2950, 7800, 8100, 7900, 3300, 3050, 4000, 4200, 3150, 3400, 7700, 8200,
3250, 4400, 3100, 4200, 4500, 4800, 4300, 8500, 8200, 8900, 8700, 3250, 3000, 4600, 4100, 8400,
8800, 3350, 4700, 3100, 8100, 3050, 8300, 4100, 3100, 8300, 8900, 8200, 4400, 4500, 3250, 4600,
8400, 3300, 4200, 4500, 4800, 4300, 8500)
bootstrap <- function (data, statistic, B){ #function takes in the data, a function to calculate the stat, and the number of resamples
n <- length(data)
stats <- numeric(B)
for (i in 1:B){
new_data <- sample(data, n, replace = TRUE) #new random sample with replacement
stats[[i]] <- statistic(new_data) #calculate stat from new sample and add it to list
}
return(stats)
}
statistic <- function (data){
mean(data)
}
stats <- bootstrap(data, statistic, 5000) #get 5000 sample means using bootstrap method
F_r <- density(stats) #calculating densities
plot_df <- data.frame(
t = F_r$x,
Boot = F_r$y
)
muhat <- mean(data)
sehat <- sd(data) / sqrt(length(data))
plot_df$Asymptotic <- dnorm(plot_df$t, mean = muhat, sd = sehat) #adding the asymptotic sampling distribution
kde_plt <- ggplot(plot_df, aes(x = t)) +
geom_line(aes(y = Asymptotic, color = "Asymptotic"), linewidth = 1, linetype = "dashed") +
geom_line(aes(y = Boot, color = "Bootstrap"), linewidth = 1) +
labs(
title = "Bootstrap vs Asymptotic Sampling Distribution (Mean)",
x = "t",
y = "Density"
) +
scale_color_manual(values = c("red", "blue")) +
theme(plot.title = element_text(hjust = 0.5))
ggplotly(kde_plt)
The KDE of the bootstrap sample means appears symmetric and
uni-modal, suggesting that the sampling distribution of the mean is
approximately normal. The distribution using the bootstrap method is
nearly identical to the asymptotic distribution. Therefore, using the
Central Limit Theorem to derive the asymptotic sampling distribution of
the sample mean would’ve been acceptable.
- Repeat the analysis in parts (a) and (b) for the sample
variance.
The Central Limit Theorem for variance states that for a sufficiently
large sample size, which we have, the sample distribution of the sample
variance approaches a normal distribution. Since the data is i.i.d, the
CLT should be applicable.
data <- c(2850, 3200, 2900, 3100, 2950, 7800, 8100, 7900, 3300, 3050, 4000, 4200, 3150, 3400, 7700, 8200,
3250, 4400, 3100, 4200, 4500, 4800, 4300, 8500, 8200, 8900, 8700, 3250, 3000, 4600, 4100, 8400,
8800, 3350, 4700, 3100, 8100, 3050, 8300, 4100, 3100, 8300, 8900, 8200, 4400, 4500, 3250, 4600,
8400, 3300, 4200, 4500, 4800, 4300, 8500)
bootstrap <- function (data, statistic, B){
n <- length(data)
stats <- numeric(B)
for (i in 1:B){
new_data <- sample(data, n, replace = TRUE)
stats[[i]] <- statistic(new_data)
}
return(stats)
}
statistic <- function (data){ #same code change statistic to variance
var(data)
}
stats <- bootstrap(data, statistic, 5000)
F_r <- density(stats)
plot_df <- data.frame(
t = F_r$x,
R_density = F_r$y
)
kde_plt <- ggplot(plot_df, aes(x = t, y = R_density)) +
geom_line(linewidth = 1) +
labs(
title = "Bootstrap Sample Variance KDE",
x = "t",
y = "Density"
) +
theme(plot.title = element_text(hjust = 0.5))
ggplotly(kde_plt)
The KDE of the bootstrap sample variances appears symmetric and
uni-modal, suggesting that the sampling distribution of the variance is
approximately normal. Therefore, using the Central Limit Theorem to
derive the asymptotic sampling distribution of the sample variance
would’ve been acceptable.
---
title: "Assignment 3: ECDF and Bootstrap Sampling and Applications"
author: "Charlie Morgan"
date: " Due: 02/16/26"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: no
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 3
    fig_height: 3
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
editor_options: 
  chunk_output_type: inline
---

```{css, echo = FALSE}
#TOC::before {
  content: "Table of Contents";
  font-weight: bold;
  font-size: 1.2em;
  display: block;
  color: navy;
  margin-bottom: 10px;
}


div#TOC li {     /* table of content  */
    list-style:upper-roman;
    background-image:none;
    background-repeat:none;
    background-position:0;
}

h1.title {    /* level 1 header of title  */
  font-size: 22px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}

h4.author { /* Header 4 - and the author and data headers use this too  */
  font-size: 15px;
  font-weight: bold;
  font-family: system-ui;
  color: navy;
  text-align: center;
}

h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Gill Sans", sans-serif;
  color: DarkBlue;
  text-align: center;
}

h1 { /* Header 1 - and the author and data headers use this too  */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}

h2 { /* Header 2 - and the author and data headers use this too  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - and the author and data headers use this too  */
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 14px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

/* Add dots after numbered headers */
.header-section-number::after {
  content: ".";

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

}
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("pander")) {
   install.packages("pander")
   library(pander)
}
if (!require("ggplot2")) {
  install.packages("ggplot2")
  library(ggplot2)
}
if (!require("tidyverse")) {
  install.packages("tidyverse")
  library(tidyverse)
}

if (!require("plotly")) {
  install.packages("plotly")
  library(plotly)
}
####
knitr::opts_chunk$set(echo = TRUE,       # include code chunk in the output file
                      warning = FALSE,   # sometimes, you code may produce warning messages,
                                         # you can choose to include the warning messages in
                                         # the output file. 
                      results = TRUE,    # you can also decide whether to include the output
                                         # in the output file.
                      message = FALSE,
                      comment = NA
                      )  
```
 
 \
 
## **Assignment Objectives** 

* Understand the theoretical basis of Bootstrap sampling methods for approximating sampling distributions.

* Assess the performance of Bootstrap sampling distributions against exact and asymptotic sampling distributions.

* Implement Bootstrap sampling algorithm and construct sampling distributions using R.

\

## **Question 1: Asymptotic vs Bootstrap Sampling Distributions**

Write an essay summarizing the concepts of Asymptotic and Bootstrap Sampling Distributions, along with their key applications. Your discussion should be grounded in your personal understanding of the material. Any external sources including AI tools consulted must be clearly cited. 


**Essay Prompt**: Discuss the concepts of the bootstrap sampling plan, the bootstrap sampling distribution, and the asymptotic sampling distribution in the context of statistics (e.g., sample mean and variance) computed from an independent and identically distributed (i.i.d.) sample. Your discussion should:

* Clearly outline the key assumptions required for each method.

* Explain the practical application of each distribution.

* Provide guidance on when and why one should be preferred over the other in statistical inference.

### Asymptotic Sampling Distributions

An asymptotic sampling distribution refers to what happens to a statistic’s sampling distribution as the sample size $n \to \infty$. In other words, if you repeatedly take samples of size n and compute a statistic each time, the distribution of that statistic will approach a specific limiting distribution as n becomes very large.

The most common example is the sample mean. Suppose we have observations $X_1, X_2, \ldots, X_n$ that are independent and identically distributed (i.i.d.) from some population with mean $\mu$ and variance $\sigma^2$. Then by the Central Limit Theorem, 
$$
\bar{X} \to N\left(\mu,  \frac{\sigma}{\sqrt{n}}\right) \quad \text{as } n \to \infty
$$
The assumptions here are:

* The observations must be independent.

* The observations must be identically distributed.

* The population must have a finite mean and finite variance.

* The sample size must be sufficiently large for the normal approximation to be accurate.

This is extremely useful because once we know the distribution of the statistic, we can use it to construct confidence intervals, perform hypothesis tests, and compute probabilities.

There are similar asymptotic results for other statistics. For example, the sample proportion, when based on independent Bernoulli trials with probability of success p, follows:
$$
\hat{p} \stackrel{d}{\sim} N\left(p, \frac{p(1-p)}{n}\right) \quad \text{for large } n
$$
The assumptions for this result are:

* The trials are independent.

* Each trial has the same probability of success p.

* Both np and n(1−p) are sufficiently large to justify the normal approximation.

There is also an asymptotic result for sample variance. When sample size n is large, the sample variance $S^2$ is approximately normally distributed as
$$
S^2 \stackrel{d}{\to} N\left(\sigma^2, \frac{\mu_4 - \sigma^4}{n}\right) \quad \text{as } n \to \infty,
$$
Overall, asymptotic sample distributions help us better understand statistics as sample size grows and provide an easy way to perform statistical inference when exact distributions are hard to come by.

### Bootstrap Sampling Distributions

Asymptotic theory proves that a statistic follows a certain distribution when n is large. But what if n is not large? Or what if your statistic is something like the median, a percentile, or some other statistic where there is no simple formula? This is where bootstrap sampling comes into play.

Bootstrap sampling is used to approximate the sampling distribution of a statistic with fewer assumptions. Bootstrap actually uses the data itself to estimate the distribution. Suppose you observe a sample: $X_1, X_2, \ldots, X_n$. You first compute whatever statistic you are interested in. Then you repeatedly draw n observations at random with replacement from the original sample. Each resample forms a new “bootstrap sample.” For each bootstrap sample, you recompute the statistic. After repeating this process many times (often thousands), you obtain a distribution of bootstrap statistics. This empirical distribution is then used as an approximation to the statistic’s true sampling distribution.

Bootstrap assumes the sample is representative of the population. If the sample is biased so will the bootstrap. Once again, observations must be independent and identically distributed. The sample size must be sufficiently large, if you have an extremely small n it won't represent the population well. Bootstrap distributions can be used for the same inferences as asymptotic distributions (CIs, Hypothesis Tests, etc.). 

You commonly use bootstrap when working with smaller sample sizes or complex statistics where the sampling distribution is unknown or difficult to derive. You also use bootstrap when trying to understand how much your statistic would fluctuate from sample to sample. Since you repeatedly compute the statistic on many re-sampled datasets, the standard deviation of the bootstrap statistics provides an estimate of the statistic’s variability. This value is interpreted as the bootstrap estimate of the standard error, which tells you how stable your statistic is.

Overall, both asymptotic and bootstrap methods approximate the sampling distribution of a statistic in order to perform inference. Asymptotic methods rely on theoretical large sample results, while bootstrap methods rely on resampling from observed data. Each approach has its advantages, and understanding their assumptions allows us to choose the appropriate method for a given statistical problem.

## **Question 2: Daily Coffee Sales (in mL) at Two Different Cafe Locations**

This data set represents the volume of regular brewed coffee sold per day (in milliliters) at two different cafe locations over a period of 50 days. 

```
2850, 3200, 2900, 3100, 2950, 7800, 8100, 7900, 3300, 3050, 4000, 4200, 3150, 3400, 7700, 8200, 
3250, 4400, 3100, 4200, 4500, 4800, 4300, 8500, 8200, 8900, 8700, 3250, 3000, 4600, 4100, 8400, 
8800, 3350, 4700, 3100, 8100, 3050, 8300, 4100, 3100, 8300, 8900, 8200, 4400, 4500, 3250, 4600, 
8400, 3300, 4200, 4500, 4800, 4300, 8500
```
We are interested in finding the sampling distribution of sample means that will be used for various inferences about the underlying population mean.

a) Based on the given data, can the Central Limit Theorem be used to derive the asymptotic sampling distribution of the sample mean? Justify your answer.

The Central Limit Theorem states: the sampling distribution of the sample mean becomes approximately normal as sample size becomes large, regardless of the population's distribution. We don't know the population's distribution which is fine. We know we have a sample size of n = 55 which is large enough. However, the data must be independent and identically distributed. The prompt says each value represents the amount of coffee sold per day at two different cafe locations over a period of 50 days, yet we have 55 observations. If there are two cafes, that means some observations come from one distribution and others from a different distribution? This means the data isn't identically distributed. If each observation was a combination of sales, than the data would be identically distributed. For the sake of the problem, I'm going to assume the second case, and therefore we can use the Central Limit Theorem to derive the asymptotic sampling distribution of the sample mean.

b) Apply the bootstrap method to estimate the sampling distribution (often called the bootstrap sampling distribution) of the sample mean. Generate a kernel density estimate from the bootstrap sample means and plot it. Then, use this bootstrap distribution to validate your conclusion from part (a). Make sure your visuals are effective in enhancing the presentation of these results.
```{r}
data <- c(2850, 3200, 2900, 3100, 2950, 7800, 8100, 7900, 3300, 3050, 4000, 4200, 3150, 3400, 7700, 8200, 
3250, 4400, 3100, 4200, 4500, 4800, 4300, 8500, 8200, 8900, 8700, 3250, 3000, 4600, 4100, 8400, 
8800, 3350, 4700, 3100, 8100, 3050, 8300, 4100, 3100, 8300, 8900, 8200, 4400, 4500, 3250, 4600, 
8400, 3300, 4200, 4500, 4800, 4300, 8500)

bootstrap <- function (data, statistic, B){ #function takes in the data, a function to calculate the stat, and the number of resamples
  n <- length(data)
  stats <- numeric(B)
  for (i in 1:B){
    new_data <- sample(data, n, replace = TRUE) #new random sample with replacement
    stats[[i]] <- statistic(new_data) #calculate stat from new sample and add it to list
  }
  return(stats)
}
  
statistic <- function (data){
  mean(data)
}

stats <- bootstrap(data, statistic, 5000) #get 5000 sample means using bootstrap method

F_r <- density(stats) #calculating densities

plot_df <- data.frame(
  t = F_r$x,
  Boot = F_r$y
)

muhat <- mean(data)
sehat <- sd(data) / sqrt(length(data))

plot_df$Asymptotic <- dnorm(plot_df$t, mean = muhat, sd = sehat) #adding the asymptotic sampling distribution

kde_plt <- ggplot(plot_df, aes(x = t)) +
  geom_line(aes(y = Asymptotic, color = "Asymptotic"), linewidth = 1, linetype = "dashed") +
  geom_line(aes(y = Boot, color = "Bootstrap"), linewidth = 1) +
  labs(
    title = "Bootstrap vs Asymptotic Sampling Distribution (Mean)",
    x = "t",
    y = "Density"
  ) +
  scale_color_manual(values = c("red", "blue")) +
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(kde_plt)
```
The KDE of the bootstrap sample means appears symmetric and uni-modal, suggesting that the sampling distribution of the mean is approximately normal. The distribution using the bootstrap method is nearly identical to the asymptotic distribution. Therefore, using the Central Limit Theorem to derive the asymptotic sampling distribution of the sample mean would've been acceptable.

c) Repeat the analysis in parts (a) and (b) for the sample variance.

The Central Limit Theorem for variance states that for a sufficiently large sample size, which we have, the sample distribution of the sample variance approaches a normal distribution. Since the data is i.i.d, the CLT should be applicable.
```{r}
data <- c(2850, 3200, 2900, 3100, 2950, 7800, 8100, 7900, 3300, 3050, 4000, 4200, 3150, 3400, 7700, 8200, 
3250, 4400, 3100, 4200, 4500, 4800, 4300, 8500, 8200, 8900, 8700, 3250, 3000, 4600, 4100, 8400, 
8800, 3350, 4700, 3100, 8100, 3050, 8300, 4100, 3100, 8300, 8900, 8200, 4400, 4500, 3250, 4600, 
8400, 3300, 4200, 4500, 4800, 4300, 8500)

bootstrap <- function (data, statistic, B){
  n <- length(data)
  stats <- numeric(B)
  for (i in 1:B){
    new_data <- sample(data, n, replace = TRUE)
    stats[[i]] <- statistic(new_data)
  }
  return(stats)
}
  
statistic <- function (data){ #same code change statistic to variance
  var(data)
}

stats <- bootstrap(data, statistic, 5000)

F_r <- density(stats)

plot_df <- data.frame(
  t = F_r$x,
  R_density = F_r$y
)

kde_plt <- ggplot(plot_df, aes(x = t, y = R_density)) +
  geom_line(linewidth = 1) +
  labs(
    title = "Bootstrap Sample Variance KDE",
    x = "t",
    y = "Density"
  ) +
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(kde_plt)
```
The KDE of the bootstrap sample variances appears symmetric and uni-modal, suggesting that the sampling distribution of the variance is approximately normal. Therefore, using the Central Limit Theorem to derive the asymptotic sampling distribution of the sample variance would've been acceptable.